Performance Analysis of Spark using k-means
نویسنده
چکیده
Big Data has long been the topic of fascination for Computer Science enthusiasts around the world, and has gained even more prominence in the recent times with the continuous explosion of data resulting from the likes of social media and the quest for tech giants to gain access to deeper analysis of their data. This paper discusses two of the comparison of Hadoop Map Reduce and the recently introduced Apache Spark – both of which provide a processing model for analyzing big data. Although both of these options are based on the concept of Big Data, their performance varies significantly based on the use case under implementation. This is what makes these two options worthy of analysis with respect to their variability and variety in the dynamic field of Big Data. In this paper we compare these two frameworks along with providing the performance analysis using a standard machine learning algorithm for clustering (K-Means).
منابع مشابه
Mathematical Modeling and Analysis of Spark Erosion Machining Parameters of Hastelloy C-276 Using Multiple Regression Analysis (RESEARCH NOTE)
Electrical discharge machining has the capability of machining complicated shapes in electrically conductive materials independent of hardness of the work materials. This present article details the development of multiple regression models for envisaging the material removal rate and roughness of machined surface in electrical discharge machining of Hastelloy C276. The experimental runs are de...
متن کاملClash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
MapReduce and Spark are two very popular open source cluster computing frameworks for large scale data analytics. These frameworks hide the complexity of task parallelism and fault-tolerance, by exposing a simple programming API to users. In this paper, we evaluate the major architectural components in MapReduce and Spark frameworks including: shuffle, execution model, and caching, by using a s...
متن کاملOil Quality Monitoring in Gasoline Spark Ignition Engine Using Oil Dielectric Measurement
Oil analysis is one of the main methods for monitoring the status of internal combustion engines and oils to ensure the lubricant's protective performance and engine health. This paper investigate oil condition monitoring in gasoline engine using oil dielectric coefficient measurement. At first capacitive sensor pectinate type is designed and manufactured that engine oil can pass through the se...
متن کاملResearch of Performance of Distributed Platforms Based on Clustering Algorithm
With the deep development and application of Internet technology, data need to be processed more and more, when dealing with large amounts of data. Spark is a versatile high-performance and parallel computing framework, which can be applied to data mining. This paper is based on the parallelization of platforms’ K-means algorithm, by building a YARN cluster environment and making experiments to...
متن کاملShared Execution of Clustering Tasks
Clustering is a central problem in non-relational data analysis, with k-means being the most popular clustering technique. In various scenarios, it may be necessary to perform clustering over the same input data multiple times – with different values of k, different clustering attributes, or different initial centroids – before arriving at the final solution. In this paper, we propose algorithm...
متن کامل